Markov Decision Process
The Agent–Environment Interface
At each time step the agent implements a mapping from states to probabilities of selecting a possible action. The mapping is called the agent's policy, denoted , where is the probability of the agent selecting action in state .
In general, actions can be any decision we want to learn how to make, and states can be any interpretation of the world that might inform those actions.
The boundary between the agent and environment is much closer to the agent than is first intuitive. E.g. if we are controlling a robot, the voltages or stresses in its structure are part of the environment, not the agent. Indeed reward signals are part of the environment, despite very possibly being produced by the agent e.g. dopamine.
Markov Property
The future is independent of the past given the present.
- In contrast to normal causal processes, we would think that our expectation of the state and reward at the next timestep is a function of all previous states, rewards and actions, as follows:
- If the state is Markov, however, then the state and reward right now completely characterize the history, and the above can be reduced to
- Markov property is important in RL because decisions and values are assumed to be a function only of the current state.
Markov Decision Process (MDP)
From the four-argument dynamics function, p, one can compute anything else one might want to know about the environment, such as the state-transition probabilities,
We can also compute the expected rewards for state–action pairs as a two-argument function :
and the expected rewards for state–action–next-state triples as a three-argument function ,
Goals and Rewards
Reward Hypothesis
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
Return (Goal)
In general, we seek to maximize the expected return, where the return, denoted Gt, is defined as some specific function of the reward sequence. In the simplest case, the return is the sum of the rewards:
where T is a final time step.
This approach makes sense in applications that finish or are periodic. That is, the agent environment interaction breaks into episodes.
We call these systems episodic tasks. e.g playing a board game and trips through a maze etc.
Notation for state space in an episodic task varies from the conventional case to
The opposite, continuous applications are called continuing tasks. For these tasks, we use discounted returns to avoid a sum of returns going to infinity.
If the reward is a constant +1, then the return is
Returns at successive time steps are related to each other in a way that is important for the theory and algorithms of reinforcement learning:
Alternatively, we can write
Policies and Value Functions
Policy
A policy is a distribution over actions given states,
State Value Function
The value function of a state under a policy , denoted , is the expected return when starting in and following thereafter. For MDPs, we can define formally by
where denotes the expected value of a random variable given that the agent follows policy , and is any time step. Note that the value of the terminal state, if any, is always zero. We call the function the state-value function for policy .
In recursive form (recall Bellman Equation):
Action Value Function
Similarly, we define the value of taking action in state under a policy , denoted , as the expected return starting from , taking the action , and thereafter following policy :
We call the action-value function for policy \pi.
Optimal Policies and Optimal Value Functions
The main objective of an RL algorithm is the best behavior an agent can take to maximize the goal.
The optimal state-value function is the maximum value function over all policies
The optimal action-value function is the maximum action-value function over all policies
For the state–action pair , this function gives the expected return for taking action in state and thereafter following an optimal policy. Thus, we can write in terms of as follows:
Extensions of MDPs
- Infinite and Continuous MDPs
- Partially observable MDPs
- Undiscounted, average reward MDPs